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ABSTRACT 

This paper describes the duchamp source finder, a piece of software designed 
to find and describe sources in 3-dimensional, spectral-line data cubes, duchamp 
has been developed with Hi (neutral hydrogen) observations in mind, but is widely 
applicable to many types of astronomical images. It features efficient source detection 
and handling methods, noise suppression via smoothing or multi-resolution wavelet 
reconstruction, and a range of graphical and text-based outputs to allow the user to 
understand the detections. This paper details some of the key algorithms used, and 
illustrates the effectiveness of the finder on different data sets. 
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1 INTRODUCTION 

Large-scale blind surveys in astronomy provide a wealth of 
information about the Universe. They are the best means 
to gain knowledge in an unbiased way about populations of 
sources, particularly if the only limitation is the flux limit 
imposed by observational constraints. For determining the 
nature of the local Universe, neutral hydrogen (Hi) surveys 
provide a view of the bulk of the baryonic mass, whilst also 
giving insights into the dark matter content. The H I Parkes 
All-Sky Survey (HIPASS, Barnes et al. 2001) is the best 
example of a truly large-scale blind H I survey, yielding cat- 
alogues of 4315 sources over all the sky south of 5 — +2° 
(Meyer et al. 2004) and a further 1002 sources between 
+2° < S < +25° 30' (Wong et al. 2006), plus catalogues of 
the 1000 brightest galaxies (Koribalski et al. 2004) and high- 
velocity clouds (Putman et al. 2002). These results provide 
a great deal of information on the H I properties of galax- 
ies, the Hi mass function, the local velocity fields, and the 
cosmic Hi mass density. 

Blind surveys require a good method of detecting ob- 
jects, to build up a reliable catalogue that is as complete 
as possible. Such detection techniques must cope with the 
presence of noise that provides spurious sources and hides 
real but faint ones. The HIPASS survey featured a source 
detection procedure that required a large amount of human 
input - each source was verified several times by eye. As one 
increases the size of the spectral cubes, in terms of spatial 
size and the number of channels, this approach quickly be- 
comes unfeasible, and a reliable automated source finding 
algorithm becomes essential to extract the required science 
catalogues from surveys. 
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An example of an instrument driving such develop- 
ment is the Australian Square Kilometre Array Pathfinder 
(ASKAP, DeBoer et al. 2009). This will be a survey tele- 
scope, built in the radio-quiet SKA candidate site of West- 
ern Australia, that will be able to do, amongst other things, 
H I surveys over large areas of sky and a wide redshift range 
(due to its 30 square degree field of view and 300 MHz band- 
width). Ten "Survey Science Projects" (SSPs) have been 
identified as the major projects to consume at least 75% 
of ASKAP's initial five years, and two of these are extra- 
galactic H I surveys: the "Widefield ASKAP L-band Legacy 
All-sky Blind surveY" (WALLABY, Koribalski & Staveley- 
Smith 2009), an all-sky survey to z ~ 0.26 that is one of two 
top-ranked projects (along with the "Evolutionary Map of 
the Universe" (EMU, Norris et al. 2011) all-sky continuum 
survey), and "Deep Investigations into Neutral Gas Origins" 
(DINGO, Meyer 2009), a deeper survey over a smaller area 
to z ~ 0.5. (Note that these are not the only SSPs with 
source finding requirements, they are simply the two surveys 
for extragalactic H I emission.) The large data rates that will 
be produced by ASKAP will necessitate largely automated 
processing, and so a robust, reliable source finder will be 
essential if the surveys are to meet their science goals. 

The anticipated demands of ASKAP, combined with 
the realisation that there was no publicly-available generic 
3D source-finder suitable for such work, prompted the de- 
velopment of the duchamp source-finding software package. 
This is a stand-alone piece of software, not closely associ- 
ated with any particular instrument or survey, but aims to 
be a general-purpose 3D source finder. The algorithms, how- 
ever, are being evaluated by the ASKAP development team 
and the Survey Science teams for inclusion in the processing 
pipelines for ASKAP. 

duchamp has been available for some years now, and 
has been used with a variety of data from single-dish (Wong 
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2007; McDonnell et al. 2008; Putman et al. 2009; Hsu et al. 
2011; Purcell et al. 2012) and interferometer (Johnston- 
Hollitt et al. 2008; Lah et al. 2009; Popping & Braun 2011) 
telescopes, as well as an aid to visualisation techniques 
(Fluke et al. 2010; Hassan et al. 2011). This paper aims 
to demonstrate the utility of duciiamp by presenting some 
of its key development aspects: the source detection tech- 
niques, statistical calculations and noise-suppression tech- 
niques, as well as memory management and software de- 
sign. It also describes some of the testing techniques that 
have been used in its evaluation. All examples of duchamp 
processing use version 1.1.13 of the software, the most recent 
version at time of writing. 



2 THE duchamp SOURCE FINDER 

2.1 Philosophy and Design of DUCHAMP 

The philosophy I have used in developing duchamp has been 
to make things as friendly as possible for the user. This has 
resulted in straightforward user input and a range of graph- 
ical and text-based outputs to enable the user to accurately 
judge the quality of their detections. 

The aim of duchamp is to provide the user with in- 
formation about the location of the sources of interest in 
the data. It does not make any prior assumptions about the 
nature of those sources (e.g. assuming they are Gaussian- 
shaped), nor does it perform any profile fitting or analysis 
beyond direct measurements of the provided data. The aim 
is to provide the user with the location of the interesting 
sources. To that end, mask cubes can be written to high- 
light the detected pixels, allowing the user to do their own 
post-processing. 

duchamp has been designed for use primarily with 
spectral-line cubes, and this colours the thinking in much of 
the design. Spectral-line cubes have two spatial dimensions 
and one spectral dimension representing frequency, wave- 
length or velocity, and this distinction is reflected in a num- 
ber of areas, such as reference to "channel maps", being 
2D spatial slices at a particular spectral value. Despite this, 
duchamp is readily applicable to two-dimensional, and even 
one-dimensional data. 

The other main assumption made by duchamp is that 
the sources are sparsely populated throughout a cube that 
is dominated by noise. If this assumption does not hold (for 
example, in the case of observations of diffuse Hi from the 
Milky Way), processing will still be able to be done, but any 
statistics duchamp calculates will be biased by the presence 
of signal. The noise calculations duchamp employs are de- 
tailed in Sec. 4. 



2.2 Obtaining and using the DUCHAMP software 

duchamp is open-source (GPL licensed) software, written 
in C++, and is publicly available for download from the 
duchamp web site 1 . There, one can find the source code, an 
extensive User's Guide and links to required packages. These 
are: CFITSIO 2 (Pence 1999), to handle operations on FITS 
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Figure 1. A flow-chart indicating the key processing steps used 
by DUCHAMP. The pre-processing options are user-selectable, and 
are typically done before the determination of the threshold used 
in searching. 



files; wcslib 3 (described in Calabretta & Greisen 2002), to 
correctly convert between pixel and world coordinates; and 
PGPLOT 4 , to produce the graphical output, duchamp should 
run in any UNIX/Linux-based environment, and is being 
maintained to ensure this continues to be the case. To this 
end, feedback is encouraged should any problems arise when 
using it. 



2.3 Summary of processing 

This section summarises the different steps undertaken by 
duchamp in finding sources in an image. Certain elements 
of the processing will be explained in detail in later sections. 
Fig. 1 indicates the key processing steps that duchamp will 
perform on an image. 

duchamp requires two inputs. The first is an image or 
cube, in FITS format. The image merely needs to conform 
to the FITS standard (Pence et al. 2010) and have an IM- 
AGE extension. FITS files produced by common reduction 



1 http:/ /www. atnf.csiro.au/peoplc/Matthew.Whiting/Duchamp 3 http:/ /www. atnf.csiro.au/pcoplc/mcalabre/WCS/indcx. html 

2 http://hcasarc.gsfc.nasa.gov/docs/software/fitsio/fitsio.html 4 http://www.astro.caltech.edu/~tjp/pgplot/ 
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packages, such as miriad and CASA, should be suitably com- 
patible. The second input is a set of parameters that govern 
the execution. These parameters allow a large degree of flex- 
ibility both in choosing different processing algorithms and 
in fine tuning many aspects of the searching and output. 

Following the reading of the image file, which can be 
the entire image or a specified subsection (that is, the user 
can nominate pixel ranges for each dimension of the im- 
age), a user-selectable set of pre-processing can be done 
prior to searching. This can be simple inversion of the im- 
age (to search for negative features), a basic spectral band- 
pass subtraction to remove a continuum ripple or some other 
large-scale spectral variation, or either smoothing or multi- 
resolution wavelet reconstruction to enhance faint sources. 
These latter two approaches are described in Sec. 6. 

The threshold is then set according to the input param- 
eters as either a flux or signal-to-noise threshold - Sec. 4 pro- 
vides details of how the threshold is determined. Sources are 
then extracted from the image, and merged to form objects. 
These are then parameterised - if relevant, the inversion and 
bandpass subtraction are reversed prior to parametrisation. 
The detection, merging and parametrisation steps are de- 
scribed in Sec. 5. 

The final object catalogue is then ready, and can be 
returned in a variety of formats, including ASCII text and 
VOTable/XML. Appendix B gives an example of the ASCII 
output file, duchamp also provides a range of graphical out- 
put to assist the user in understanding the nature of the de- 
tected sources. These include spatial maps showing the lo- 
cation and size of each source, and individual plots for each 
source showing the both the spectrum and the Oth-moment 
map. 

Additional outputs include FITS images containing the 
mask indicating the location of detected sources, or the 
smoothed or reconstructed image (these FITS images will 
have their header information formatted in the same way 
as the input image). These allow post-processing to extract 
individual sources, or fit particular functions to detected ob- 
jects as dictated by the particular science of interest to the 
user. 



3 REPRESENTATION OF DETECTED 
OBJECTS 

To keep the memory usage manageable for the largest range 
of input data possible, duchamp implements specialised 
techniques for storing the location of detected objects. A 
naive method could be to store each single pixel, but this 
results in a lot of redundant information being stored in 
memory. 

To reduce the storage requirements, the run-length en- 
coding method is used for storing the spatial information. In 
this fashion, an object in 2D is stored as a series of "runs", 
encoded by a row number (the y- value), the starting col- 
umn (the minimum a;- value) and the run length (£ x : the 
number of contiguous pixels in that row connected to the 
starting pixel). A single set of (y,x,£ x ) values is called a 
"scan" . A two-dimensional image is therefore made up of a 
set of scans. An example can be seen in Fig. 2. Note that the 
object shown has fourteen pixels, and so would require 28 
integers to record the positions of all pixels. The run-length 
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Figure 2. An example of the run-length encoding method of 
storing pixel information. The scans used to encode the image 
arc listed alongside the relevant row. The pixels are colour-coded 
by nominal pixel values, but note that the pixel values themselves 
do not form part of the encoding and are not kept as part of the 
object class. 

encoding uses just 18 integers to record the same informa- 
tion. The longer the runs are in each scan, the greater the 
saving of storage over the naive method. 

A 3D object is stored as a set of channel maps, with a 
channel map being a 2D plane with constant z-value. Each 
channel map is itself a set of scans showing the (x, y) position 
of the pixels. 

4 STATISTICAL ANALYSIS AND 
THRESHOLD DETERMINATION 

4.1 Thresholds in DUCHAMP 

The fundamental step in source finding is the application of 
a threshold, as this determines whether a given voxel is part 
of a source (and therefore of some interest), or part of the 
background, duchamp uses a single threshold for the entire 
cube, leading to an output catalogue with a uniform selec- 
tion criterion. This approach lends itself best to data that 
has uniform noise, but this need not be the case: duchamp 
has features that are able to remove some background struc- 
ture (such as large-scale spectral baseline features). The abil- 
ity to vary the threshold in response to the local noise would 
allow deeper searching where the sensitivity warrants, and 
such features are part of development under way for the 
ASKAP source-finding pipeline. 

The threshold for duchamp can be given as a flux value 
directly, or it can be expressed as a signal-to-noise level (a 
so-called "na" threshold). In this case, the value of a needs 
to be measured from the image statistics. Either the entire 
set of image pixels are used, or a subsection can be specified 
(via the parameter file input) that specifies which pixels to 
measure the statistics from. 

This latter option allows the user to, for instance, mea- 
sure the noise level in a region of the cube known to be free 

5 "Voxel" refers to a "volume element" or "volumetric pixel" - 
the 3D equivalent of the 2D picture element or pixel. 



© 2009 RAS, MNRAS 000, 1-16 



4 Matthew Whiting 



of sources (for example, a frequency or velocity range in a 
spectral cube in which no sources are expected). It may be, 
however, that no such region can be found, and that the 
threshold has to be based on noise statistics estimated from 
data containing sources (i.e. voxels that are not pure noise) . 
duchamp provides mechanisms to achieve this, either by us- 
ing robust methods to estimate the noise, or by setting the 
threshold in an alternative way to the "na" approach. These 
methods are detailed in the following sections. 

4.2 Robust techniques 

When searching for small objects in a noisy background de- 
tection thresholds are often set at some multiple of the stan- 
dard deviation of the image background. However, direct 
measurements of this value can be biased by the presence 
of bright pixels, yielding overestimates. Furthermore, these 
bright pixels are often in the sources of interest or some 
additional artefact (such as interference), rather than the 
background, and so should not contribute to the background 
calculation. 

It can be preferable, then, to calculate the noise proper- 
ties via robust methods, duchamp makes use of the median 
absolute deviation from the median (henceforth MADFM) 
as a proxy for the standard deviation. Although computa- 
tionally expensive (it requires the data to be at least par- 
tially sorted twice 6 ), it is very robust against bright pixels. 
It is best used in the situation where the data in question 
is dominated by a large number of noise pixels, with a rela- 
tively small number of brighter pixels present. It should be 
noted that the standard error of the median is 1.253 times 
that of the mean, and so can be more susceptible to fluctu- 
ations of the sampling, but with well-behaved noise this is 
acceptable. 

Note that the assumption duchamp makes about the 
nature of the data, that there are relatively isolated sources 
embedded in data dominated by noise, is important here. 
The median and MADFM will still be biased by the presence 
of bright non-noise pixels, as what will be interpreted as the 
centre of the distribution will be displaced from the centre of 
the noise pixel distribution. The bias, however, only depends 
on the number of non-noise pixels compared to the total 
size of the array under consideration, and not on the actual 
values of the bright pixels. 

To compute the MADFM, s, one first calculates the 
median value of the array, and then the median value of the 
absolute difference between the pixel values and the median: 

s = mod (\xi — med(xi)\) (1) 

For a pure Gaussian distribution of values, the MADFM 
does not give the same value as the standard deviation, but 
can be scaled easily to determine what the standard devi- 
ation of that distribution is. This scaling can be calculated 
analytically. The Normal, or Gaussian, distribution for mean 
H and standard deviation a has a density function 

6 Only partial sorting is required, since the median requires just 
the middle clcment(s). An algorithm such as nth_element from 
the C++ Standard Template Library is suitable. 



The median, m, is the middle of the distribution, defined for 
a continuous distribution by 

/m poo 
f(x)dx = / f(x)dx. 
-oo J m 

From symmetry (since f(x — //) = f(n — x)), we quickly 
see that for the continuous Normal distribution, m = fj,. 
We consider the case henceforth of fi = 0, without loss of 
generality. 

To find the MADFM, s, we find the distribution of the 
absolute deviation from the median, and then find the me- 
dian of that distribution. This distribution is given by 

g(x) — distribution of \x\ 

= f(x) + f{-x), x > 

V H(J 

So, the median absolute deviation from the median, s, is 
given by 

PS poo 

I g(x)dx = / g(x)dx. 

JO J s 

If we use the identity 

r e- x2,2 " 2 dx= 
Jo 

we find that 

J™ e-* 2/2 ° 2 dx = - jT e~ x2/2 ° 2 dx. 

Hence, 

/V* 2/2CT2 dx-A/^78 = 0, 
Jo 

which gives s = s[2 erf _1 (0.5)o- = 0.6744888cr. Thus, to esti- 
mate a for a Normally distributed data set, one can calculate 
s, then divide by 0.6744888 (or multiply by 1.4826042) to 
obtain the correct estimator. 

This conversion can be compared to solutions quoted 
elsewhere. For instance, Meyer et al. (2004) use the MADFM 
to estimate the noise in a cube, then quote a "cube noise" pa- 
rameter defined by Clearly this "cube noise" should 
not be interpreted as the standard deviation of the noise 
(despite being called Rms cubc in the HICAT catalogue) - it 
is in fact only 82% of the standard deviation as calculated 
above. 

These robust techniques can be used in duchamp to 
provide accurate estimates of the noise background. When 
thresholds are defined according to a multiple of the stan- 
dard deviation, the MADFM is always converted to an 
equivalent standard deviation using the conversion factor 
obtained above. Users are cautioned, however, that even 
these robust techniques can be biased by non-noise signal. 
In the case that the data being examined is purely noise plus 
some positive signal (or artefacts), then the median will not 
fall in the middle of the noise distribution, but slightly to 
the positive side, and similarly for the MADFM. For typical 
data, however, where the real signals are sparsely populated 
and outnumbered by noise voxels this bias will be small, and 
certainly smaller than the bias arising from normal statistics 
(mean and standard deviation). 

It may also be the case that the noise is not precisely 
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Gaussian. For instance, Popping & Braun (2011) consider 
WSRT interferometer images and find an excess number of 
negative pixels over that expected from an ideal Gaussian 
distribution. These are ascribed to interferometry artefacts 
in the image, which are expected to be symmetric about zero 
flux due to the purely relative nature of interferometric data 
when auto-correlations are not included. In practice, if one 
knows, or can estimate the scaling between MADFM and a, 
then it is straightforward to scale the requested threshold 
such that the appropriate na level is applied. These points 
highlight the need for users to have some understanding of 
the nature of the data being searched. 

4.3 False discovery rate technique 

Although one can estimate the standard deviation more ac- 
curately with the robust techniques, the problem of false 
discoveries will still be present. A technique has been pre- 
sented (Miller et al. 2001; Hopkins et al. 2002) that attempts 
to control the false discovery rate (FDR) in the presence 
of noise. This fixes the false discovery rate (given by the 
number of false detections divided by the total number of 
detections) to a certain value a (e.g. a = 0.05 implies 5% 
of detections are false positives). In practice, an a value is 
chosen, and the ensemble average FDR (i.e. (FDR)) when 
the method is used will be less than a. One calculates p 
- the probability, assuming the null hypothesis is true, of 
obtaining a test statistic as extreme as the pixel value (the 
observed test statistic) - for each pixel (in duchamp, Nor- 
mal statistics are assumed), and sorts them in increasing 
order. One then calculates d where 

f ja 1 

d = max <J : Pj < > , 

j I c N N J 

and then rejects all hypotheses whose p- values are less than 
or equal to Pd- (So a Pi < Pd will be rejected even if Pi ^ 
ja/cNN.) Note that "reject hypothesis" here means "accept 
the pixel as an object pixel" (i.e. we are rejecting the null 
hypothesis that the pixel belongs to the background). 

The cat value here is a normalisation constant that de- 
pends on the correlated nature of the pixel values. If all the 
pixels are uncorrelated, then cjv = 1. If JV pixels are cor- 
related, then their tests will be dependent on each other, 
and so cjv = Hopkins et al. (2002) consider real 

radio data, where the pixels are correlated over the beam. 
For the case of three-dimensional spectral data, the beam 
size would be multiplied by the number of correlated chan- 
nels. This number can be specified via the input parameter 
set 7 , and so a user can set it according to their knowledge 
of the cube being searched. This can be important if the 
cube is known to have been smoothed prior to searching, or 
if the telescope backend results in some correlation between 
neighbouring channels. 

The theory behind the FDR method implies a direct 
connection between the choice of a and the fraction of detec- 
tions that will be false positives. These detections, however, 
are individual pixels, which undergo a process of merging 
and rejection, and so the fraction of the final list of detected 
objects that are false positives will often be much smaller 
than a. 

7 Using the parameter FDRnumCorChan. 



5 DETECTION OF OBJECTS 
5.1 Searching strategy 

The problem of identifying sources in a two-dimensional im- 
age, by means of a single threshold, has well defined solu- 
tions. Lutz (1980) presented an algorithm for locating ob- 
jects in an image with a single raster-scan pass through the 
image. Such algorithms are possible because of the well- 
nested nature of objects in two dimensions: on a given row 
of an image, if object A lies between two portions of object 
B, then all of object A lies between those two portions. 

The problem of 3-dimensional source extraction is, in 
general, a more complex one. The extra dimension means 
that the well-nested property no longer applies, and so dis- 
tinct objects can be intertwined while still remaining dis- 
tinct. This makes it hard for a simple raster-scanning (or 
equivalent) algorithm to be applied. 

The approach duchamp uses has two aspects: mul- 
tiple lower-dimensional searches, followed by merging of 
the detected sources to produce complete three-dimensional 
sources. The simplest search duchamp can do is to treat 
each spatial pixel as a distinct spectrum, search with a sim- 
ple ID search routine, and combine the objects afterwards. 
The alternative (actually the default) is to treat each fre- 
quency channel as a two-dimensional image, and search it 
using the algorithm of Lutz (1980). The algorithm performs 
a raster scan of the image, processing one horizontal line at 
a time and builds up a list of objects by keeping track of 
which objects in each row are connected (in an 8-fold man- 
ner) to objects on the previous row. Objects from different 
images are then combined afterwards (see Sec. 5.2). 

duchamp has been developed for three-dimensional 
data, and even though we are using two- or one-dimensional 
searching, we know that the objects we are interested in 
will be three dimensional (that is, extended in all three di- 
rections). Ideally one would want to make use of this fact 
to help identify sources. A good way to do this is to use 
the smoothing or wavelet reconstruction approaches intro- 
duced in Sec. 6. Generally, smoothing in a direction perpen- 
dicular to the direction of the searching will respond well 
to 3D structures. For instance, if a 3D source is smoothed 
spectrally, then its brightness in each 2D channel map will 
be enhanced relative to the background due to the flux in 
neighbouring channels. Searching in 2D will then be more 
sensitive than a search without the prior smoothing. 

A single detection threshold is used at this point, effec- 
tively treating each voxel in a binary manner. Once a source 
list has been established (see Sec. 5.2), the objects can be 
"grown" to a secondary threshold if required. This enables 
the detection of all faint features provided they have at least 
one (or some number - the minimum size of a detection is a 
user-selectable parameter) voxel above the primary thresh- 
old. 



5.2 Merging of sources 

To form the final catalogue, objects found in individual 
channel maps or spectra need to be merged. Merging takes 
place if a pair of objects are judged to be close according to 
criteria determined by the user. Either the objects must be 
adjacent, meaning there is at least one pair of voxels (one 
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voxel from each object) that are adjacent in one direction, 
or they must have a separation less than a given number of 
voxels (which can differ between spatial and spectral sepa- 
rations) . 

The merging is done in two stages. When a single search 
of a 2D plane or a ID spectrum is complete, each detected 
object (stored as a set of scans) is added to a master list of 
detections. It is first compared to all those present in the list, 
and if it is close to one (according to the criteria defined by 
the user) the two objects are merged. No other comparisons 
are made at this point (to save time). If it is close to none, 
it is added to the end of the list. 

Once all the searching has been completed, the list 
needs to be examined for any further pairs of close objects. 
Although an initial comparison has been done, there will be 
unexamined pairs present. For instance, consider objects A 
and B, which are not close, and a new object C that is close 
to both A & B. When C is added to the list, it is first com- 
pared to A, found to be close, and combined. The list then 
has the combined AC object and B, which are both close. 
A second comparison needs to be made to combine these 
two. This secondary merging stage looks at successive pairs 
of objects from the list. If they are close, the second object 
is added to the first and removed from the list. This ensures 
that all possible combinations are found. 

The way the comparison is done is to first look at the 
pixel ranges (minimum & maximum in each dimension) to 
see if they overlap or lie suitably close according to the above 
criteria. This is a quick way of ignoring objects that should 
not be merged without doing a detailed examination. If they 
do, then the individual scans (refer Sec. 3) from common 
channels in each object are examined until a pair is found 
that satisfy the criteria for closeness. Note that this can be 
done directly on scans, without needing to look at individual 
pixels. 

If growing of sources is requested, this is done after 
the merging step. A second merge is then performed to en- 
sure that no duplicate detections are present, due to sources 
growing over each other. 

5.3 Parametrisation of sources 

Once the unique set of sources has been found, duchamp 
measures a basic set of parameters for them. The location 
of the source is measured in three ways: the peak position, 
or the voxel containing the maximum flux value; the "aver- 
age" position, being the average of all pixel values in each 
dimension 

where N is the number of pixels; and the "centroid" posi- 
tion, being the flux-weighted average of all pixels in each 
dimension 

f^,^,^) (3) 

\ ft FT rT / 

where Ft = 2 i s the total flux of all detected pixels. 

All positions and spectral locations are given in both 
pixel and world coordinates, making use of the WCSLIB 8 li- 

8 http: / /www. atnf.csiro.au/people/mcalabre/WCS/indcx.html 



brary (described in Calabretta & Greisen 2002). The sizes of 
the sources are reported: in the spatial direction, this is just 
the extent of the detected source (although not every pixel 
within this extent is necessarily detected - if the closeness 
criteria permit it, there can be voxels below the threshold 
within this range). 

In the spectral direction the parameters W50 and W20 
are provided, being the width at 50% and 20% of the peak 
flux. These are measured on the integrated spectrum (i.e. 
the spectra of all detected spatial pixels summed together), 
and are defined by starting at the outer spectral extent of the 
object (the highest and lowest spectral values) and moving 
in or out until the required flux threshold is reached. The 
widths are then just the difference between the two values 
obtained. The full spectral width of the detection (WVEL) is 
also reported - this is analogous to the spatial extent, being 
the extent of the detected source in the spectral direction. 

The flux of the source is given by the peak flux and both 
the "total flux", defined above as Ft, and the "integrated 
flux" Fi, or the total flux integrated over the spectral extent 
and corrected for the beam: 

Fl = — — 

where B — ira/3 /41n(2) is the area of a beam of major and 
minor axes a and /3 (in pixels), and Avi is the spectral width 
of each voxel. 

These parameters are intended as a basic set, as 
DUCHAMP has not been optimised for any particular survey. 
Those wanting more detailed or science-specific parameters 
should make use of the mask cubes (as discussed in Sec. 2.1) 
to extract the locations of detected objects and do their own 
processing from there. 

Note that these parameters are measured using just the 
detected pixels, since DUCHAMP does no fitting of spectral 
or spatial profiles. This means that some parameters, in- 
tegrated flux for instance, will be biased by not including 
voxels that fall below the detection (or secondary) threshold. 
The exceptions to this are W50 and W20, as these are calcu- 
lated using the full integrated spectrum, including channels 
not necessarily forming part of the detection. 



6 NOISE SUPPRESSION 

The presence of noise ultimately limits the ability to detect 
objects. This occurs in two ways: reducing the completeness 
of the final catalogue by making faint sources appear to have 
a flux erroneously below the detection threshold; and reduc- 
ing the catalogue reliability by generating false detections 
through detection of spurious noise peaks. DUCHAMP imple- 
ments several methods to minimise the effect of the noise 
and improve the reliability, which are discussed here. 

6.1 Smoothing 

One option is to smooth the data, either spectrally (via a 
Hanning filter), or spatially (via convolution with a Gaus- 
sian kernel). When the size of the filter/kernel is chosen 
appropriately, these approaches can be very effective at en- 
hancing structures of a particular scale, and suppressing the 
finer-scale noise fluctuations. They are limited, however, to 
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Figure 3. An illustration of the effect of smoothing on the detectability of sources. See text for description, (a) A 1-dimensional 
model spectrum, with five Gaussian sources (labels are referred to in the text). The central three have the same peak but different 
FWHM: 5, 15 and 45 channels. The first and last have the same integrated flux as the central peak, (b) The model spectrum with 
random noise added, drawn from a standard normal distribution. A 3<r threshold is indicated by the dotted line, and the bars under 
the spectrum indicate contiguous sources above the threshold, (c) The noisy spectrum smoothed with a width of 5 pixels, with the 
3<r threshold, as measured from the spectrum, indicated, (d) As for (c), but smoothed with a width of 15 channels, (c) As for (c), 
but smoothed with a width of 45 channels. 
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enhancing just the scale determined by the filter size. This 
becomes an a priori choice made by the user, which may be 
appropriate if searching for a particular type of signal (for 
instance, galaxies of a particular velocity width), but can 
bias the results of a truly blind search by suppressing other 
scales. 

To examine how the choice of scale influences the out- 
come, Fig. 3 shows the effect of smoothing on differently- 
sized sources (just in one dimension, for ease of represen- 
tation). The model spectrum, shown in (a) and in (b) (the 
latter with noise added), contains five sources. The flux scale 
is chosen so that the noise has a standard deviation of 1 unit. 
The three central sources #2, #3 & #4 each have the same 
peak flux of 2 with different Full Width at Half Maximum 
(FWHM) values: 5, 15 and 45 channels respectively, while 
#1 and #5 have the same integrated flux as #3, but the 
same width as #2 and #4 respectively. Panel (b) shows this 
spectrum with noise added, and with a 3cr threshold super- 
imposed. The value of a is that measured from the spectrum 
itself, and so is a little larger than 1 due to the presence of 
positive sources(in this example it is 1.11). Below the spec- 
trum we indicate with bars where we detect sources above 
this threshold. Only the first two sources are detected - the 
second perhaps due to a boost from the noise. 

Panel (c) shows the spectrum after smoothing with a 
Hanning filter of the same width as the first two sources. 
This keeps the first two sources visible, and makes enough 
structure visible in sources #3 and #4 for them to be de- 
tected, although note that #4 is fragmented. When smooth- 
ing with the width of source #2 (panel (d)), all five sources 
are able to be recovered, with source #4 now forming a single 
detection. By the time we smooth with the width of sources 
#4 and #5 (panel (e)), the weakness of #2 results in it not 
being detected - we are starting to smooth over structure on 
its scale (#1 is still detected, but its peak strength is greatly 
reduced). 

6.2 Wavelet reconstruction 

6.2.1 Algorithm 

Clearly, smoothing with a given filter imposes a choice of 
scales for which the search is sensitive, but this may re- 
sult in information on other scales (either finer or broader) 
being lost. An alternative approach is to use a multi- 
resolution wavelet transform that works over a range of 
scales, duchamp uses the a trous, or "with holes", transform 
(Starck & Murtagh 1994) to sample the structure on scales 
at logarithmically-spaced intervals. The algorithm used in 
the a trous transform is as follows: 

(1) Initialise the reconstructed array to zero everywhere. 

(2) Discretely convolve the input array with the chosen 
filter function. 

(3) Define the wavelet coefficients by taking the difference 
between the convolved array and the input array (input — 
convolved) . 

(4) Threshold the wavelet coefficients array: only keep 
those values above a designated threshold, and add them 
to the reconstructed array. 

(5) Double the separation between elements of the filter 
function, creating the holes in the filter coverage alluded to 
by the transform name. 



(6) Repeat from step 2, using the convolved array instead 
of the input array. 

(7) Continue until the required maximum number of 
scales is reached. This is defined by the largest scale where 
the filter function size is not greater than the array size. 

(8) Add the final smoothed (i.e. convolved) array to the 
reconstructed array. This provides the "DC offset" , as each 
of the wavelet coefficient arrays will have zero mean. 

The doubling of the filter scale means that a range of 
scales from the finest to the broadest are sampled in cre- 
ating the reconstructed array. The a trous transform is a 
so-called "redundant" transform. This means that the sum 
of all wavelet arrays, plus the final smoothed array, results 
in the input array. Symbolically, denoting the input and out- 
put arrays by / and O, and the smoothed and wavelet arrays 
at scale i by Si and Wi, this can be written 

O = ^Wi + S n 

= (I- Si) + (Si - S 2 ) + ... + {Sn-i - S n ) + s n 

= I 

Applying a threshold to the wavelet coefficients has the 
effect of only keeping those parts of the input array where 
there is significant signal at some scale. If the threshold is 
chosen well, the random noise present in astronomical im- 
ages can be suppressed quite effectively. The threshold ap- 
plied is a constant multiple (provided by the user) of the 
noise level in the wavelet array. 

The reconstruction may be done in one, two or three 
dimensions. The three-dimensional case reconstructs the full 
array at once, by defining a filter function that has the same 
profile in each of the three dimensions. A single largest scale 
is used, limited by the smallest array dimension. In the one- 
dimensional case, the ID spectrum for each spatial pixel 
is reconstructed independently, while the two-dimensional 
case sees each 2D channel map reconstructed independently. 
Examples of reconstructions with different dimensionality 
are discussed in Sec. 6.2.2 below. 

It was shown in Sec. 6.1 and Fig. 3 how smoothing re- 
duces the measured noise in a spectrum, by making pixels 
more and more correlated with their neighbours. The impli- 
cation for the wavelet reconstruction of this is to reduce the 
measured noise level at each scale, affecting how the thresh- 
old is applied. Since we need to know each scale's noise, to 
apply a fixed signal-to-noise threshold, we need to calculate 
it for each wavelet array. Rather than doing so directly, we 
scale the noise from that measured in the input array to take 
into account the increasing correlation of the pixels. 

To do this, for a given filter function, we transform a 
spectrum of zero flux save for a single unit pixel in the centre. 
By summing in quadrature the pixel values at each wavelet 
scale, we can find the scaling factor for the standard devia- 
tion. This procedure is explained in detail in Appendix A. 

These calculations depend on the pixels in the origi- 
nal array being independent. If they are not (for instance, 
the beam or point spread function of a telescope will result 
in partially correlated neighbouring pixels, or neighbouring 
channels may be partly correlated by the instrument or by 
some processing prior to duchamp analysis), taking the sum 
in quadrature is not the appropriate course of action. In this 
case, the only way to obtain the correct scaling factors would 
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Figure 4. A detailed view of what happens at each scale during the reconstruction. The left-hand column shows the spectrum 
smoothed at successively larger scales, all with the same vertical axis range. The right-hand column shows the wavelet spectra, 
being the difference between the smoothed spectrum to the left and the one preceding it (or, in the case of the first scale, the first 
smoothed spectrum and the original spectrum). Also shown on the wavelet spectra, as dotted lines, are the 3cr thresholds used in 
the reconstruction. 
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be through simulations that properly account for the corre- 
lations in the pixels. 

6.2.2 Example 

In Fig. 4, the step-by-step detail of a reconstruction is shown, 
highlighting what happens to the smoothed and wavelet 
spectra at each scale. The left hand side of the figure shows 
the spectrum smoothed at successively larger scales - the 
effects described in Sec. 6.1 are apparent here too. An 8th 
smoothed spectrum, the "DC offset" described above, is also 
computed, although, for simplicity, is not plotted (it is a 
fairly flat line and has no matching right hand side column) . 

The right hand side of Fig. 4 shows how the wavelet 
spectrum changes from scale to scale. These spectra show 
where the strongest signal (either positive or negative) can 
be found for a given scale - note how source #1 is strongest 
between scales 3 and 5, #3 strongest at scales 5 and 6, and 
#4 at 6 and 7. Also shown on the wavelet spectra are the 
thresholds used in the reconstruction. These are set at ±3er, 
where a for each spectrum is scaled in the manner described 
above from that measured in the input spectrum. Only data 
outside the dotted lines are added together (along with the 
DC offset) to create the reconstructed spectrum. 

The effectiveness of the wavelet reconstruction are evi- 
dent in, Fig. 5, where the results of reconstructing the model 
spectrum used in Fig. 3 are shown. Three wavelet threshold 
levels are used: 2, 3 (as used in Fig. 4) and 4 a. Clearly, 
as the threshold is increased, more of the fine-scale noise is 
removed. Note, however, that the sources, particularly the 
fainter, narrower ones, get affected as well, so that the 4a 
reconstruction starts to lose sources #2 and #3. 

In setting the threshold, the wavelet reconstruction re- 
quires a different approach to the smoothing case. A sig- 
nal to noise threshold is determined based on the statistics 
measured from the residual spectrum. If the wavelet recon- 
struction worked perfectly, this would contain only the noise, 
and so would give the exact measurement of the image noise. 
But the threshold is applied to the reconstructed spectrum, 
which, ideally, would hold just the sources of interest. Since 
there is (ideally) no noise present, the threshold can be set 
much lower than one would otherwise consider. The example 
in Fig. 5 shows a la threshold applied to all cases. When ap- 
plied to the raw spectrum, a vast number of clearly spurious 
sources is returned, but the reconstructions remove the bulk 
of these spurious sources. This can be taken to extremes - a 
0.2a detection threshold even allows recovery of source #5 
for all cases, with only 4 and 1 additional (spurious) sources 
for the 3<r and 4a reconstructions respectively. 

Since the residuals have had structures that are not 
noise removed, estimating the noise properties from them 
provide much better estimates. The values for the la thresh- 
old in each of the spectra in Fig. 5 are 1.25 for the input 
spectrum, compared to 0.92, 1.03 and 1.06 for the 2a, 3a 
and 4<j results respectively (these values include the mean - 
the robust estimates for the standard deviation of the noise 
are 1.11 for the input, and 0.89, 0.99 and 1.03 for the re- 
spective reconstructions). 

In Fig. 6, we show the effect of changing the dimen- 
sionality of the reconstruction. For this, we use real data 
from the HIPASS survey (Barnes et al. 2001). Shown is 
a single spectrum, taken from the HIPASS cube #201 at 



RA = 6 h 12 m 2=9, Dec. = -21° 42' 24" 8. This position is 
chosen to highlight two galaxies at velocities of 861 km s^ 1 
(UGCA 120) and 2319 km s" 1 (NGC 2196) which have quite 
different spectral profiles - this highlights how the recon- 
struction recovers structure on different scales. Three re- 
constructions are performed, using the one-, two- and three- 
dimensional algorithms. Clearly, all three approaches recover 
both galaxies, while removing a large proportion of the noise. 

There are differences, however. There is a larger de- 
gree of channel-to-channel noise remaining in the 2D re- 
construction, and it appears that more of the input spec- 
trum is included in the reconstructed spectrum (see around 
v ~ 0km s _1 and in the range of the broader galaxy). This 
appears to be largely due to the way the reconstruction 
has been done. The 2D reconstruction works on each chan- 
nel map separately, so each channel in the extracted spec- 
trum has been calculated independently, depending not on 
its neighbouring channels but the structure in the channel 
map. The channel maps are also where the largest pixel- 
to-pixel correlations exist (that is, within a beam), which 
allows beam-sized noise fluctuations to rise above the recon- 
struction threshold more frequently than in the ID case. 

The 3D case, while still containing more noise than the 
ID (for the same reason - it is also seeing the spatially- 
correlated beam noise), does seem to give a better- looking 
reconstruction of the shape of both the galaxies and the 
spectral baseline (the sharp negative features next to the 
galaxies in the ID spectrum are less well-defined in the 3D 
spectrum). The 3D reconstruction uses the full range of in- 
formation for a given pixel, so that structure in all three 
directions contributes to the information about its recon- 
structed value. 



7 EVALUATING duchamp 
7.1 Testing 

The aim of this section is to provide an indication of the rel- 
ative performance of the different algorithms of duchamp. 
This is done through simple tests using synthetic data purely 
to provide indicative performance. Synthetic cubes were 
used for this purpose, rather than real survey data from, 
say, HIPASS, as this prevents systematic effects that limit 
the completeness of the survey from affecting the interpre- 
tation of the completeness of the source-finder. A complete 
set of tests that is relevant for a particular survey would use 
simulations and/or real data that better match the noise 
and source characteristics expected for that survey, and so 
is beyond the scope of this paper. 

The synthetic data used for these tests were cubes of 
dimensions 128 x 128 x 512 voxels, with a grid of sources 
of varying brightnesses. We use 16 sources, ranging in peak 
brightness from 1.5 to 9 units, increasing in increments of 
0.5. Each source is spatially unresolved, so that it lies in a 
single pixel, and has a spectral shape of a Gaussian with a 
5 channel FWHM. The locations of the sources are chosen 
so that there is at most one source in a given channel map 
or spectrum. 

The cube then has noise added to each voxel, sampled 
from a normal distribution with a standard deviation of 1 
flux unit (meaning the flux units in the cube are essentially 
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Figure 5. An illustration of the wavelet reconstruction, using the same data as for Fig. 3, with three different values of the threshold 
applied to the wavelet components, (a) The same model spectrum as used in Fig. 3a), with the sources marked, (b) The input 
spectrum (again identical to Fig. 3) with a la detection threshold indicated by the dotted line. The resulting detected sources are 
shown under the spectrum. The remaining three panels show wavelet reconstruction using a reconstruction threshold of: (c) 2a (d) 
3a (the results of the reconstruction shown in detail in Fig. 4) and (c) la. 
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Figure 6. A comparison of wavelet reconstruction with different dimensionality, (a) The input spectrum (b) ID reconstruction 
(c) 2D reconstruction (d) 3D reconstruction. For the reconstructed spectra, the bottom graph shows the residual (input — re- 
constructed). The same wavelet threshold, 3cr, was used in each case. The input data is a cube from HIPASS (cube #201, at 
RA = 6 h 12 m 2f 9, Dec. = -21° 42' 24'.'8) that shows two galaxies in the one spectrum. Only a single ID spectrum is shown (from 
a single spatial pixel), but the full cube was used in the reconstruction. The vertical axis limits have been chosen to highlight the 
majority of the spectrum, but this means the bright Galactic HI at channel 100 is truncated. 



units of a), and is then convolved with a Gaussian beam that 
has a 3-pixel FWHM. This cube is then searched for sources. 
This process is repeated 100 times for different realisations 
of the noise, to provide an idea of the average completeness 
and reliability of the algorithms. 



We also make a reference cube without noise, but with 
the same beam. This provides the "truth" result - what 
would be found in the ideal case of no noise. This refer- 
ence cube is searched both to a depth of 0.01 (providing the 



full extent of sources) , and to the same depth as searches in 
the noisy cubes. 

The cubes produced in this manner are then able to be 
searched using different duchamp parameter sets. Each pa- 
rameter set will correspond to a different mode of operation: 
basic searching, smoothing or wavelet reconstruction, with 
different requirements on the minimum number of spatial 
pixels & spectral channels, as well as different smoothing 
widths or reconstruction thresholds. Each parameter set is 
then applied with a range of input signal-to-noise thresh- 
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Source peak flux 

Figure 7. Comparison of the completeness as a function of the 
peak flux of sources for different detection thresholds and different 
rejection techniques. The solid lines show searches at (from left- 
to-right) 4cr, 5(7 and 6(7, with no restriction on the size of objects. 
The dashed lines show searches at 3cr, 4(7 and 5c, rejecting sources 
that span only one channel. The reliability (R, the fraction of 
detected sources that are real) are indicated for each search type. 

olds: 3, 4, 5, 6, 7 and 8. A given signal-to-noise threshold 
will provide different effective flux thresholds when different 
pre-processing is done: Note that the different pre-processing 
will result in different effective flux thresholds for a given 
signal-to-noise: in the smoothing case, the noise is measured 
from the smoothed cube, where it is lower due to the corre- 
lated pixels, and the threshold is applied to the same cube; 
while in the reconstruction case the noise is measured from 
the rejected pixels (which ideally will indeed be noise), and 
the threshold is applied to the reconstructed pixels. 

Searching each of the 100 cubes, and then cross- 
matching with the truth result, allows us to find the average 
number of times each source is detected, subject to the noise. 
In this way, a completeness curve can be constructed for a 
given search threshold, showing the fraction of sources de- 
tected as a function of peak source brightness. We can also 
determine the reliability for a given search threshold, being 
the fraction of detected sources that correspond to a true 
source. 

The reliability measure here is implicitly dependent on 
the image size. In the case of these simulations, a larger 
image size would result in more noise detections with no 
additional true sources, leading to a decrease in reliability. 
The key aspect to look at when examining the results of 
the tests is the different dependence of the reliability on the 
detection threshold from test to test. 



7.2 Evaluation of Results 

In Fig. 7, we compare different searches that use no pre- 
processing, but only differing degrees of rejection by apply- 
ing minimum numbers of channels and spatial pixels. It is 
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Figure 8. As for Fig. 7, but for searches with different types of 
pre-processing: (a) smoothing with different widths: 3, 11 and 21 
pixels (b) ID wavelet reconstruction with different reconstruction 
thresholds: 2a, 3c and 4cr. In each plot, each pre-processing type 
has its own line style, while each detection threshold has its own 
symbol, allowing comparison of the same detection threshold used 
with different pre-processing. 

apparent that applying even a small degree of post-finding 
rejection (rejecting sources that span only a single channel), 
the reliability greatly improves, from 4% for a 4cr search to 
95%. The completeness, however, does reduce: the complete- 
ness level for the same signal-to-noise threshold increases by 
about 1 unit. This is due to single-channel matches that 
align with the true sources - many of these are probably 
chance coincidences of noise peaks (which are more likely 
to be single-channel sources) , and so applying this rejection 
will have improved the quality of the resultant catalogue. 
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Furthermore, the decrease in the number of spurious 
sources means that deeper searches can be done. The deepest 
one can go with less than 1 source in 10 spurious is a 6a 
search when accepting all sources, but by rejecting single- 
channel sources one can use a 4a search and get at least 1 
flux unit deeper for the same completeness. 

Fig. 8 compares a few different types of preprocessing, 
together with a basic search rejecting single-channel sources. 
Both spectral smoothing and one-dimensional wavelet re- 
construction are considered, each with a range of the critical 
parameter (smoothing width or reconstruction threshold). 

Fig. 8(a) shows the spectral smoothing results, using 
widths of 3, 11 and 21 channels. These widths span scales 
below and above the spectral widths of the sources (Gaus- 
sians with FWHM of 5 pixels). The 11-pixel case does the 
best in terms of completeness, as it is most closely matched 
to the shape of the sources. 

Fig. 8(b) shows the results from wavelet reconstruction 
case. These are slightly more complicated - as was seen in 
Sec. 6.2.2, increasing the reconstruction threshold allows you 
to search to deeper signal-to-noise thresholds, and increases 
the reliability at the expense of completeness for a search at 
a given threshold. The symbols in Fig. 8(b) relate to the de- 
tection threshold, while the lines relate to the reconstruction 
threshold. 

For the particular simulation under consideration, 
where the sources have a single spectral size, the smoothing 
approach does slightly better than the wavelet, although the 
difference in the completeness level is only about half a flux 
unit. These tests indicate that getting complete below about 
4a is very hard, even with ideally behaved noise. 

7.3 Processing time 

The other consideration in choosing a processing method is 
the time required. This is particularly the case in a survey, 
where a large number of cubes may have to be searched, 
or in the case where a number of different search types are 
required. While the utility of providing execution times is 
debatable, given the range of capabilities that are available 
to researchers, being able to compare execution times for the 
same type of search with different processing options can at 
least provide some idea of the relative processing load. 

To this end, duchamp was run using a MacBook Pro 
(2.66GHz, 8MB RAM) on the HIPASS cube #201 (of size 
170 x 160 x 1024 voxels, or 53 MB), with detection thresh- 
olds of both 10 8 Jy beam -1 (no sources will be found, so 
that the time taken is dominated by preprocessing), and 
35 mjy beam -1 (or ~ 2.58a, chosen so that the time taken 
will include that required to process sources). The basic 
searches, with no pre-processing done, took less than a sec- 
ond for the high-threshold search, but between 1 and 3 min 
for the low-threshold case - the numbers of sources detected 
ranged from 3000 (rejecting sources with less than 3 chan- 
nels and 2 spatial pixels) to 42000 (keeping all sources). 

When smoothing, the raw time for the spectral smooth- 
ing was only a few seconds, with a small dependence on the 
width of the smoothing filter. And because the number of 
spurious sources is markedly decreased (the final catalogues 
ranged from 17 to 174 sources, depending on the width of 
the smoothing), searching with the low threshold did not add 
much more than a second to the time. The spatial smoothing 



was more computationally intensive, taking about 4 minutes 
to complete the high-threshold search. 

The wavelet reconstruction time primarily depended on 
the dimensionality of the reconstruction, with the ID taking 
20 s, the 2D taking 30 - 40 s and the 3D taking 2 - 4 min. 
The spread in times for a given dimensionality was caused 
by the different reconstruction thresholds, with lower thresh- 
olds taking longer (since more pixels are above the threshold 
and so need to be added to the final spectrum). In all cases 
the reconstruction time dominated the total time for the 
low-threshold search, since the number of sources found was 
again smaller than the basic searches. 



8 CONCLUSIONS 

In response to the growing challenges of accurate and reliable 
source extraction in large 3D spectral-line datasets, I have 
developed duchamp, a stand-alone source- finding package. 
Since the noise is the limiting factor in source extraction, 
duchamp uses robust methods to accurately estimate the 
noise level, and provides pre-processing options to prevent 
the cataloguing of as many noise-generated spurious sources 
as possible. 

duchamp has been designed for use with spectral-line 
surveys in general, and Hi surveys in particular, but can 
readily be applied to other types of data. The core func- 
tionality of duchamp is already forms the basis for the 
prototype source finder for the ASKAP processing pipeline, 
and is being evaluated, along with other source-finding ap- 
proaches, by the ASKAP Survey Science Teams for inclu- 
sion in the final pipeline software, duchamp is maintained 
independently, however, and is provided as a separate, open- 
source software package. 
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9 http:/ /www. astro. caltech.edu/~tjp/pgplot/ 
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11 http://www.atnf.csiro.au/peoplc/mcalabrc/WCS/index.html 
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APPENDIX A: HOW GAUSSIAN NOISE 
CHANGES WITH WAVELET SCALE 

As discussed in Section 6.2.1, the d trous algorithm requires 
that the thresholds applied at each scale are equivalent. The 
threshold is defined as a multiple of the standard deviation of 
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Table Al. Standard deviation scaling coefficients for three dif- 
ferent wavelet filter functions, when used in ID, 2D and 3D situ- 
ations. The coefficients defining each filter are shown at the top 
of each column. 



the original spectrum, scaled appropriately for each wavelet 
scale. 

However, since the wavelet arrays are produced by con- 
volving the input array by an increasingly large filter, the 
pixels in the coefficient arrays become increasingly corre- 
lated as the scale of the filter increases. This results in the 
measured standard deviation from a given coefficient array 
decreasing with increasing scale. To calculate this, we need 
to take into account how many other pixels each pixel in the 
convolved array depends on. 

To demonstrate, suppose we have a 1-D array with N 
pixel values given by Fi, i — 1,...,N, and we convolve it 
with the B3-spline filter, defined by the set of coefficients 
{1/16, 1/4, 3/8, 1/4, 1/16}. The flux of the ith pixel in the 
convolved array will be 

F[ = ^F t _ 2 + IfU + |f 4 + i Fi+1 + ±F +2 

and the flux of the corresponding pixel in the wavelet array 
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will be 

Wi = F i -Fl = ^F i . 2 -\F i . 1 + lE i -\F i+1 -^ +2 

Now, assuming each pixel has the same standard deviation 
CT; = a, we can work out the standard deviation for the 
wavelet array: 




= 0.72349 a 



Thus, the first scale wavelet coefficient array will have a 
standard deviation of 72.3% of the input array. This proce- 
dure can be followed to calculate the necessary values for all 
scales, dimensions and filters used by duchamp. 

Calculating these values is clearly a critical step in per- 
forming the reconstruction. The method used by Starck & 
Murtagh (2002) was to simulate data sets with Gaussian 
noise, take the wavelet transform, and measure the value 
of a for each scale. We take a different approach, by calcu- 
lating the scaling factors directly from the filter coefficients 
by taking the wavelet transform of an array made up of a 
f in the central pixel and 0s everywhere else. The scaling 
value is then derived by taking the square root of the sum 
(in quadrature) of all the wavelet coefficient values at each 
scale. We give the scaling factors for the three filters avail- 
able to duchamp in Table Al. These values are hard-coded 
into duchamp, so no on-the-fly calculation of them is nec- 
essary. 

Memory limitations prevent us from calculating factors 
for large scales, particularly for the three-dimensional case 
(hence the smaller table). To calculate factors for higher 
scales than those available, we divide the previous scale's 
factor by either \/2, 2, or \/8 for ID, 2D and 3D respectively. 



APPENDIX B: EXAMPLE OF AN OUTPUT 
CATALOGUE 

Fig. Bl shows an example of the ASCII output table pro- 
duced by duchamp. This table comes from the HIPASS cube 
#201, processed with ID wavelet reconstruction (with a 3er 
reconstruction threshold) and searched to a 4cr threshold (to 
provide a small catalogue for demonstration purposes). 
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Figure Bl. Example of an ASCII-format output catalogue from DUCHAMP. Note that the columns starting with X, Y, and Z are 
all in pixel units. The file has been split into three sections to enable it to fit on the page. 
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